INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

The Booking ID number is essentially a count column. No duplicate values exist to categorize. This column can be excluded for this data processing exercise. The other object columns are type of meal plan, room type reserved, market segment type, and booking status (our dependent variable). These columns each have a low number of set categories that we can use for data processing. These object datatypes will be converted to category datatype below.

Dropping duplicate values from the dataset.

71% of the data set did not cancel while 29% of the data set did cancel. This ratio will be maintained for modeling.

It seems 35 rows of data were entered with the incorred Date/Month/Year of Feb 29, 2018. Since 2018 is not a leap year, this data is not valid. The 35 rows will be deleted from the dataset since the date is not accurate.

The 35 rows containing impossible dates were deleted.

Combined the year, month, date columns inot a single arrival_year_date column with a datetime datatype.

Exploratory Data Analysis (EDA)

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Only 3.3% of all guests are repeated guests.

Only 4.2% of guests required parking.

77% of all bookings were made online.

We can see that offline and online reservations have on average a longer lead time allowing for more time to book another guest in the same room. On average, aviation has the shortest lead time for both booked and cancelled reservations.

It appears that on average the cancelled rooms are booked at a higher rate than the rooms that were not cancelled across the

Both number of adults and number of children have generally positive correlations with increasing room price. Children numbers 3 and more show a decrease in cost, while adults show a continuous increase in room price with each additional adult.

It appears there may be a slight correlation between number of bookings previously not cancelled and the repeated guest columns. I will wait for confirmation from modeling to see which column to possibly remove.

From the data above we see approximately 71% of the rooms booked were not canceled, with 29% of the rooms booked were canceled. We will maintain this ratio of data as we segment for modeling.

No missing values in the dataset.

Data Preprocessing

EDA

There are a few high correlations, that will be closely observed once we begin to trim the model. Number of children and room type 6 reserved have a high correlation of 0.65. Repeated guest also correlates with Corporate market segment with a 0.51 correlation. There are several highly negative correlations between meal plan Type 1 and not selecting a meal plan of -0.87. Room type 1 also has a -0.82 correlation with Room Type 4. The offline market segment has a -0.79 negative correlation with market segment online.

This breaks out all the categorical variables and dropped the booking status column since that is the dependent variable.

Building a Logistic Regression model

The Stratify argument maintains the original distribution of classes in the target variable while splitting the data into train and test sets.

The confusion matrix

Checking Multicollinearity

Dropping the highest p-value Complementary.
Dropping the next highest p-value no_of_previous_bookings_not_canceled.
Dropping the next highest p-value no_of_previous_cancellations.
Dropping the next highest p-value no_of_adults.
Dropping the next highest p-value no_of_previous_bookings_not_canceled.
Dropping the next highest p-value date.
Dropping the next highest p-value Meal Plan 2.
Dropping the next highest p-value Online.
Dropping the next highest p-value Online.
Dropping the next highest p-value Online.
Dropping the next highest p-value Room_Type 2.

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train11 as the final ones and lg11 as final model.

Model performance evaluation

Coefficient Interpretations:

ROC-AUC on training set

Logistic regression gives a good performance on the model.

Precision has increased from 0.837 to 0.899.

Conclusion

Recommendations

Building a Decision Tree model

Do we need to prune the tree?

Pre Pruning tested above, now Post Pruning tested below.

Since accuracy isn't the right metric for our data we would want high recall

Model Performance Comparison and Conclusions

Since we will prioritize Test Recall, the best tree for this dataset would be the decision tree with post-pruning.

Actionable Insights and Recommendations

The three parameters of most importance when canceling a booking are lead time, Online Market Segment, and number of special requests.